-
-
Notifications
You must be signed in to change notification settings - Fork 18.8k
RFC: Introduce pandas.col
#62103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
RFC: Introduce pandas.col
#62103
Conversation
When this is added, and then released, One comment is that I'm not sure it will support some basic arithmetic, such as: result = df.assign(addcon=pd.col("a") + 10) Or alignment with other series: b = df["b"] # or this could be from a different DF
result = df.assign(add2=pd.col("a") + b) Also, don't you need to add some tests?? |
Thanks for taking a look!
Yup, they're both supported: In [8]: df = pd.DataFrame({'a': [1,2,3]})
In [9]: s = pd.Series([90,100,110], index=[2,1,0])
In [10]: df.assign(
...: b=pd.col('a')+10,
...: c=pd.col('a')+s,
...: )
Out[10]:
a b c
0 1 11 111
1 2 12 102
2 3 13 93
😄 Definitely, I just wanted to test the waters first, as I think this would be perceived as a significant API change |
I don't see it as a "change", more like an addition to the API that makes it easier to use. The existing way of using |
Is assign the main use case? |
Currently it would only work in places that accept Getting it to work in |
628a3b0
to
b41b99d
Compare
xref @jbrockmendel 's comment #56499 (comment)
I'd also discussed this with @phofl , @WillAyd , and @jorisvandenbossche (who originally showed us something like this in Basel at euroscipy 2023)
Demo:
Output:
Repr demo:
What's here should be enough for it to be usable. For the type hints to show up correctly, extra work should be done in
pandas-stubs
. But, I think it should be possible to develop tooling to automate theExpr
docs and types based on theSeries
ones (going to cc @Dr-Irv here too then)As for the "
col
" name, that's what PySpark, Polars, Daft, and Datafusion use, so I think it'd make sense to follow the conventionI'm opening as a request for comments. Would people want this API to be part of pandas?
One of my main motivations for introducing it is that it avoids common issues with scoping. For example, if you use
assign
to increment two columns' values by 10 and try to writedf.assign(**{col: lambda df: df[col] + 10 for col in ('a', 'b')})
then you'll be in for a big surprisewhereas with
pd.col
, you get what you were probably expecting:Further advantages:
<function __main__.<lambda>(df)
Expected objections:
TODO: